<!doctype html>
<html>

<head>
    <title>FragmentVC Demo</title>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
    <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.1/css/bootstrap.min.css" integrity="sha384-VCmXjywReHh4PwowAiWNagnWcLhlEJLA5buUprzK8rxFgeH0kww/aWY76TfkUoSX" crossorigin="anonymous">
    <link rel="stylesheet" href="style.css">
    <link rel="apple-touch-icon" sizes="57x57" href="favicons/apple-icon-57x57.png">
    <link rel="apple-touch-icon" sizes="60x60" href="favicons/apple-icon-60x60.png">
    <link rel="apple-touch-icon" sizes="72x72" href="favicons/apple-icon-72x72.png">
    <link rel="apple-touch-icon" sizes="76x76" href="favicons/apple-icon-76x76.png">
    <link rel="apple-touch-icon" sizes="114x114" href="favicons/apple-icon-114x114.png">
    <link rel="apple-touch-icon" sizes="120x120" href="favicons/apple-icon-120x120.png">
    <link rel="apple-touch-icon" sizes="144x144" href="favicons/apple-icon-144x144.png">
    <link rel="apple-touch-icon" sizes="152x152" href="favicons/apple-icon-152x152.png">
    <link rel="apple-touch-icon" sizes="180x180" href="favicons/apple-icon-180x180.png">
    <link rel="icon" type="image/png" sizes="192x192" href="favicons/android-icon-192x192.png">
    <link rel="icon" type="image/png" sizes="32x32" href="favicons/favicon-32x32.png">
    <link rel="icon" type="image/png" sizes="96x96" href="favicons/favicon-96x96.png">
    <link rel="icon" type="image/png" sizes="16x16" href="favicons/favicon-16x16.png">
    <link rel="manifest" href="favicons/manifest.json">
    <meta name="msapplication-TileColor" content="#ffffff">
    <meta name="msapplication-TileImage" content="favicons/ms-icon-144x144.png">
    <meta name="theme-color" content="#ffffff">
</head>

<body>
    <div class="jumbotron jumbotron-fluid">
        <div class="container">
            <h1 class="display-5">Audio & Attention Map Demo</h1>
            <p class="lead">FRAGMENTVC: ANY-TO-ANY VOICE CONVERSION BY END-TO-END EXTRACTING AND FUSING FINE-GRAINED VOICE FRAGMENTS WITH ATTENTION</p>
            <hr class="my-4">
            <p>
                <b>Abstract:</b>
                Any-to-any voice conversion aims to convert the voice from and to any speakers even unseen during training, which is much more challenging compared to one-to-one or many-to-many tasks, but much more attractive in real-world scenarios.
                In this paper we proposed FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0, while the spectral features of the utterance(s) from the target speaker are obtained from log mel-spectrograms.
                By aligning the hidden structures of the two different feature spaces with a two-stage training process, FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance, all based on the attention mechanism of Transformer as verified with analysis on attention maps, and is accomplished end-to-end.
                This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information and doesn't require parallel data.
                Objective evaluation based on speaker verification and subjective evaluation with MOS both showed that this approach outperformed SOTA approaches, such as AdaIN-VC and AutoVC.
            </p>
            <p>
                <a href="https://github.com/yistLin/FragmentVC" type="button" class="btn btn-dark">GitHub (Source Code)</a>
                <a href="https://arxiv.org/abs/2010.14150" type="button" class="btn btn-danger">arXiv (Preprint)</a>
            </p>
        </div>
    </div>

    <div class="container">
        <h3>Seen-to-seen conversion</h3>

        In the following sections, there are 4 conversion pairs, each containing 4 speech utterances.
        The first 2 utterances are drawn from the CSTR VCTK Corpus:
        <ol type="a">
            <li>
                An utterance from the source speaker, termed as <u>source utterance</u>
            </li>
            <li>
                An utterance from the target speaker, which is of the same word transcription as the source utterance, termed as <u>authentic utterance</u>
            </li>
        </ol>

        And the rest 2 utterances are the conversion results using the <u>source utterance</u> as source:
        <ol start="3" type="a">
            <li>
                A synthetic utterance generated with the <u>authentic utterance</u> as target
            </li>
            <li>
                A synthetic utterance generated with 5 randomly sampled utterances from the target speaker as target
            </li>
        </ol>

        <div class="card mb-3">
            <div class="card-header text-white bg-secondary">Pair 1</div>
            <div class="card-body">
                <dl class="row">
                    <dt class="col-sm-3">
                        Source speaker
                    </dt>
                    <dd class="col-sm-9">
                        p225
                    </dd>
                    <dt class="col-sm-3">
                        Target speaker
                    </dt>
                    <dd class="col-sm-9">
                        p227
                    </dd>
                    <dt class="col-sm-3">
                        Transcription
                    </dt>
                    <dd class="col-sm-9">
                        &#8220;
                        Ask her to bring these things with her from the store.
                        &#8221;
                    </dd>
                    <dt class="col-sm-3">
                        Source utterance
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/p225_002.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        Authentic utterance
                        <br>
                        <small>
                            from the target speaker
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/p227_002.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        Conversion result
                        <br>
                        <small>
                            with the authentic utterance as target
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/s2s-f2m-p225_002-p227_002.wav">
                            Your browser does not support the audio tag!
                        </audio>
                        <br>
                        <img src="imgs/p225_p227/s2s-f2m-p225_002-p227_002.attn.png">
                    </dd>
                    <dt class="col-sm-3">
                        Conversion result
                        <br>
                        <small>
                            with 5 randomly sampled target utterances
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/s2s-f2m-p225_002-p227_001_010_025_052_090.wav">
                            Your browser does not support the audio tag!
                        </audio>
                        <br>
                        <img src="imgs/p225_p227/s2s-f2m-p225_002-p227_001_010_025_052_090.attn.png">
                    </dd>
                </dl>
            </div>
        </div>

        <div class="card mb-3">
            <div class="card-header text-white bg-secondary">Pair 2</div>
            <div class="card-body">
                <dl class="row">
                    <dt class="col-sm-3">
                        Source speaker
                    </dt>
                    <dd class="col-sm-9">
                        p227
                    </dd>
                    <dt class="col-sm-3">
                        Target speaker
                    </dt>
                    <dd class="col-sm-9">
                        p225
                    </dd>
                    <dt class="col-sm-3">
                        Transcription
                    </dt>
                    <dd class="col-sm-9">
                        &#8220;
                        Many complicated ideas about the rainbow have been formed.
                        &#8221;
                    </dd>
                    <dt class="col-sm-3">
                        Source utterance
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/p227_020.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        Authentic utterance
                        <br>
                        <small>
                            from the target speaker
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/p225_020.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        Conversion result
                        <br>
                        <small>
                            with the authentic utterance as target
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/s2s-m2f-p227_020-p225_020.wav">
                            Your browser does not support the audio tag!
                        </audio>
                        <br>
                        <img src="imgs/p227_p225/s2s-m2f-p227_020-p225_020.attn.png">
                    </dd>
                    <dt class="col-sm-3">
                        Conversion result
                        <br>
                        <small>
                            with 5 randomly sampled target utterances
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/s2s-m2f-p227_020-p225_001_010_027_030_038.wav">
                            Your browser does not support the audio tag!
                        </audio>
                        <br>
                        <img src="imgs/p227_p225/s2s-m2f-p227_020-p225_001_010_027_030_038.attn.png">
                    </dd>
                </dl>
            </div>
        </div>

        <div class="card mb-3">
            <div class="card-header text-white bg-secondary">Pair 3</div>
            <div class="card-body">
                <dl class="row">
                    <dt class="col-sm-3">
                        Source speaker
                    </dt>
                    <dd class="col-sm-9">
                        p228
                    </dd>
                    <dt class="col-sm-3">
                        Target speaker
                    </dt>
                    <dd class="col-sm-9">
                        p232
                    </dd>
                    <dt class="col-sm-3">
                        Transcription
                    </dt>
                    <dd class="col-sm-9">
                        &#8220;
                        Many complicated ideas about the rainbow have been formed.
                        &#8221;
                    </dd>
                    <dt class="col-sm-3">
                        Source utterance
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/p228_020.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        Authentic utterance
                        <br>
                        <small>
                            from the target speaker
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/p232_020.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        Conversion result
                        <br>
                        <small>
                            with the authentic utterance as target
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/s2s-f2m-p228_020-p232_020.wav">
                            Your browser does not support the audio tag!
                        </audio>
                        <br>
                        <img src="imgs/p228_p232/s2s-f2m-p228_020-p232_020.attn.png">
                    </dd>
                    <dt class="col-sm-3">
                        Conversion result
                        <br>
                        <small>
                            with 5 randomly sampled target utterances
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/s2s-f2m-p228_020-p232_001_010_027_030_038.wav">
                            Your browser does not support the audio tag!
                        </audio>
                        <br>
                        <img src="imgs/p228_p232/s2s-f2m-p228_020-p232_001_010_027_030_038.attn.png">
                    </dd>
                </dl>
            </div>
        </div>

        <div class="card mb-3">
            <div class="card-header text-white bg-secondary">Pair 4</div>
            <div class="card-body">
                <dl class="row">
                    <dt class="col-sm-3">
                        Source speaker
                    </dt>
                    <dd class="col-sm-9">
                        p232
                    </dd>
                    <dt class="col-sm-3">
                        Target speaker
                    </dt>
                    <dd class="col-sm-9">
                        p228
                    </dd>
                    <dt class="col-sm-3">
                        Transcription
                    </dt>
                    <dd class="col-sm-9">
                        &#8220;
                        Ask her to bring these things with her from the store.
                        &#8221;
                    </dd>
                    <dt class="col-sm-3">
                        Source utterance
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/p232_002.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        Authentic utterance
                        <br>
                        <small>
                            from the target speaker
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/p228_002.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        Conversion result
                        <br>
                        <small>
                            with the authentic utterance as target
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/s2s-m2f-p232_002-p228_002.wav">
                            Your browser does not support the audio tag!
                        </audio>
                        <br>
                        <img src="imgs/p232_p228/s2s-m2f-p232_002-p228_002.attn.png">
                    </dd>
                    <dt class="col-sm-3">
                        Conversion result
                        <br>
                        <small>
                            with 5 randomly sampled target utterances
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/s2s-m2f-p232_002-p228_001_010_025_052_090.wav">
                            Your browser does not support the audio tag!
                        </audio>
                        <br>
                        <img src="imgs/p232_p228/s2s-m2f-p232_002-p228_001_010_025_052_090.attn.png">
                    </dd>
                </dl>
            </div>
        </div>

        <h3>Unseen-to-unseen conversion</h3>

        In the following sections, there are 4 conversion pairs, each containing 4 speech utterances.
        The first 2 utterances are drawn from the CMU Arctic dataset:
        <ol type="a">
            <li>
                An utterance from the source speaker, termed as <u>source utterance</u>
            </li>
            <li>
                An utterance from the target speaker, which is of the same word transcription as the source utterance, termed as <u>authentic utterance</u>
            </li>
        </ol>

        <p>
            And the last one is the conversion result generated with the <u>source utterance</u> as source and 10 randomly sampled utterances from the target speaker as target.
        </p>

        <div class="card bg-light mb-3">
            <div class="card-header text-white bg-secondary">Pair 1</div>
            <div class="card-body">
                <dl class="row">
                    <dt class="col-sm-3">
                        Source speaker
                    </dt>
                    <dd class="col-sm-9">
                        slt
                    </dd>
                    <dt class="col-sm-3">
                        Target speaker
                    </dt>
                    <dd class="col-sm-9">
                        lnh
                    </dd>
                    <dt class="col-sm-3">
                        Transcription
                    </dt>
                    <dd class="col-sm-9">
                        &#8220;
                        The Warden with a quart of champagne.
                        &#8221;
                    </dd>
                    <dt class="col-sm-3">
                        Source utterance
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/slt_a0541.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        Authentic utterance
                        <br>
                        <small>
                            from the target speaker
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/lnh_a0541.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        <big>
                            Conversion results 
                        </big>
                    </dt>
                    <dd class="col-sm-9">
                        <br>
                    </dd>
                    <dt class="col-sm-3">
                        FragmentVC
                        <br>
                        <small>
                            with 10 randomly sampled target utterances
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/u2u-f2f-slt_a0541-lnh.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        AdaIN
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/AdaIN_u2u-f2f-slt_a0541-lnh.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        AutoVC
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/AutoVC_u2u-f2f-slt_a0541-lnh.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                </dl>
            </div>
        </div>

        <div class="card bg-light mb-3">
            <div class="card-header text-white bg-secondary">Pair 2</div>
            <div class="card-body">
                <dl class="row">
                    <dt class="col-sm-3">
                        Source speaker
                    </dt>
                    <dd class="col-sm-9">
                        clb
                    </dd>
                    <dt class="col-sm-3">
                        Target speaker
                    </dt>
                    <dd class="col-sm-9">
                        rms
                    </dd>
                    <dt class="col-sm-3">
                        Transcription
                    </dt>
                    <dd class="col-sm-9">
                        &#8220;
                        The scents of strange vegetation blew off the tropic land.
                        &#8221;
                    </dd>
                    <dt class="col-sm-3">
                        Source utterance
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/clb_b0503.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        Authentic utterance
                        <br>
                        <small>
                            from the target speaker
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/rms_b0503.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        <big>
                            Conversion results 
                        </big>
                    </dt>
                    <dd class="col-sm-9">
                        <br>
                    </dd>
                    <dt class="col-sm-3">
                        FragmentVC
                        <br>
                        <small>
                            with 10 randomly sampled target utterances
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/u2u-f2m-clb_b0503-rms.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        AdaIN
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/AdaIN_u2u-f2m-clb_b0503-rms.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        AutoVC
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/AutoVC_u2u-f2m-clb_b0503-rms.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                </dl>
            </div>
        </div>


        <div class="card bg-light mb-3">
            <div class="card-header text-white bg-secondary">Pair 3</div>
            <div class="card-body">
                <dl class="row">
                    <dt class="col-sm-3">
                        Source speaker
                    </dt>
                    <dd class="col-sm-9">
                        bdl
                    </dd>
                    <dt class="col-sm-3">
                        Target speaker
                    </dt>
                    <dd class="col-sm-9">
                        ljm
                    </dd>
                    <dt class="col-sm-3">
                        Transcription
                    </dt>
                    <dd class="col-sm-9">
                        &#8220;
                        The woman in you is only incidental, accidental, and irrelevant.
                        &#8221;
                    </dd>
                    <dt class="col-sm-3">
                        Source utterance
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/bdl_a0283.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        Authentic utterance
                        <br>
                        <small>
                            from the target speaker
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/ljm_a0283.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        <big>
                            Conversion results 
                        </big>
                    </dt>
                    <dd class="col-sm-9">
                        <br>
                    </dd>
                    <dt class="col-sm-3">
                        FragmentVC
                        <br>
                        <small>
                            with 10 randomly sampled target utterances
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/u2u-m2f-bdl_a0283-ljm.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        AdaIN
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/AdaIN_u2u-m2f-bdl_a0283-ljm.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        AutoVC
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/AutoVC_u2u-m2f-bdl_a0283-ljm.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                </dl>
            </div>
        </div>

        <div class="card bg-light mb-3">
            <div class="card-header text-white bg-secondary">Pair 4</div>
            <div class="card-body">
                <dl class="row">
                    <dt class="col-sm-3">
                        Source speaker
                    </dt>
                    <dd class="col-sm-9">
                        rms
                    </dd>
                    <dt class="col-sm-3">
                        Target speaker
                    </dt>
                    <dd class="col-sm-9">
                        bdl
                    </dd>
                    <dt class="col-sm-3">
                        Transcription
                    </dt>
                    <dd class="col-sm-9">
                        &#8220;
                        Bassett was a fastidious man.
                        &#8221;
                    </dd>
                    <dt class="col-sm-3">
                        Source utterance
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/rms_a0296.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        Authentic utterance
                        <br>
                        <small>
                            from the target speaker
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/bdl_a0296.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        <big>
                            Conversion results 
                        </big>
                    </dt>
                    <dd class="col-sm-9">
                        <br>
                    </dd>
                    <dt class="col-sm-3">
                        FragmentVC
                        <br>
                        <small>
                            with 10 randomly sampled target utterances
                        </small>
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/u2u-m2m-rms_a0296-bdl.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        AdaIN
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/AdaIN_u2u-m2m-rms_a0296-bdl.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                    <dt class="col-sm-3">
                        AutoVC
                    </dt>
                    <dd class="col-sm-9">
                        <audio controls>
                            <source src="wavs/AutoVC_u2u-m2m-rms_a0296-bdl.wav">
                            Your browser does not support the audio tag!
                        </audio>
                    </dd>
                </dl>
            </div>
        </div>

    </div>

    <div class="container" style="padding-top: 60px;">
        <p class="text-center text-muted">&copy; 台大語音實驗室 NTU Speech Lab</p>
    </div>
</body>

</html>